9 research outputs found

    Fast matching statistics in small space

    Get PDF
    Computing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor\u27s memory, and they approach comparable running times when S and T are very similar

    Fast algorithms for computing sequence distances by exhaustive substring composition

    Get PDF
    The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background

    The Genome Sequence of the Leaf-Cutter Ant Atta cephalotes Reveals Insights into Its Obligate Symbiotic Lifestyle

    Get PDF
    Leaf-cutter ants are one of the most important herbivorous insects in the Neotropics, harvesting vast quantities of fresh leaf material. The ants use leaves to cultivate a fungus that serves as the colony's primary food source. This obligate ant-fungus mutualism is one of the few occurrences of farming by non-humans and likely facilitated the formation of their massive colonies. Mature leaf-cutter ant colonies contain millions of workers ranging in size from small garden tenders to large soldiers, resulting in one of the most complex polymorphic caste systems within ants. To begin uncovering the genomic underpinnings of this system, we sequenced the genome of Atta cephalotes using 454 pyrosequencing. One prediction from this ant's lifestyle is that it has undergone genetic modifications that reflect its obligate dependence on the fungus for nutrients. Analysis of this genome sequence is consistent with this hypothesis, as we find evidence for reductions in genes related to nutrient acquisition. These include extensive reductions in serine proteases (which are likely unnecessary because proteolysis is not a primary mechanism used to process nutrients obtained from the fungus), a loss of genes involved in arginine biosynthesis (suggesting that this amino acid is obtained from the fungus), and the absence of a hexamerin (which sequesters amino acids during larval development in other insects). Following recent reports of genome sequences from other insects that engage in symbioses with beneficial microbes, the A. cephalotes genome provides new insights into the symbiotic lifestyle of this ant and advances our understanding of host–microbe symbioses

    Knowledge Specialization in Ph.D. Student Groups

    No full text
    Researchers have argued that specialization within groups yields productivity gains. We evaluate this statement with a focus on groups of Ph.D. students. Using an established technique in computer science called Latent Dirichlet Allocation, we construct a novel measure of the dispersion of Ph.D. students' research interests based on their dissertation abstracts. We then relate this measure to Ph.D. group publications. For our study, we use a rich dataset on groups of Ph.D. students who studied at a major Swiss University, during the 1993-2008 period. We find robust evidence that within-group knowledge specialization is associated with a larger number of publications. However, when specialization increases beyond a critical level, it hinders the group's publication output. We interpret these results as an indication that gains, in the amount of research output, can be achieved if Ph.D. students specialize according to their comparative advantages. However, beyond a certain level, knowledge specialization has a detrimental impact on research output, due to increasing communication costs and an increased likelihood of conflict insurgence

    An encyclopedia of mouse DNA elements (Mouse ENCODE)

    No full text
    To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome
    corecore